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Current-day genomes bear the mark of the evolutionary processes. One of the strongest indica- 
tions is the sequence homology among families of proteins that perform similar biological functions 
in different species. The number of proteins in a family can grow over time as genetic information is 
duplicated through evolution. Wc explore how evolution directs the size distribution of these fami- 
lies. Theoretical predictions for family sizes are obtained from two models, one in which individual 
genes duplicate and a second in which the entire genome duplicates. Predictions from these mod- 
els arc compared with the family size distributions for several organisms whose complete genome 
sequence is known. We find that protein family size distributions in nature follow a power-law 
distribution. Comparing these results to the model systems, we conclude that genome duplication 
is the dominant mechanism leading to increased genetic material in the species considered. 
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I. INTRODUCTION 



Current-day genomes are the result of generations of 
evolution. One of the marks of evolution is the exis- 
tence of protein families. These families comprise groups 
of proteins that share sequence similarity and perform 
similar biological functions The most likely expla- 

nation for the similarity in sequence and function is that 
all the proteins in a family evolved from a single common 
ancestor. 

The size of a family, defined here as the number of 
proteins in a family for a particular species, evolves over 
time through processes that increase the physical size of 
an organism's genome. Genomes in many major lineages 
are thought to have undergone ancient doublings one or 
more times It is thought that genome doubling can 
provide an evolutionary advantage by permitting redun- 
dant genes to evolve rapidly and perform different biolog- 
ical roles, potentially allowing entire pathways to acquire 
more specific function 

At finer scales, chromosomal regions or individual 
genes may be may be duplicated or lost through evolu- 
tion. Even without physical loss, protein coding regions 
may suffer loss of function and cease to be expressed, 
leading to the existence of pseudogenes Q . 

Previous studies have detected patterns supporting 
growth and loss of genetic information. Evolutionary 
processes consisting of duplication and mutation can 
introduce long-range, power-law correlations in the se- 
quences of individual genes ; reports of such correla- 
tions in intron-rich regions sparked considerable interest 

0- 

In contrast to studies of individual gene sequences, we 
developed a model to explain the evolution of the physi- 
cal size of a genome [||. In our model, a speciation rate 
allowed genome size to increase or decrease, and an ex- 
tinction rate removed individual species. The ratio of 
the speciation and extinction rates yielded scaling laws 
for the distribution of genome sizes: exponential scaling 
when the amount of genetic material lost or gained was 
constant, and power-law scaling leading to a self-similar 
distribution when the change in genetic material was pro- 
portional to the existing size. Closed-form approxima- 
tions agreed with simulation results and explained obser- 
vations reported by others |^ . 

Here we use related models to explore size of gene fam- 
ilies. Processes that add and remove genetic material are 
presented in Sec. II. In the first model, we assume that 
duplication occurs on the level of individual genes. In 
the second model, we assume that these events dupli- 
cate an entire genome. Closed-form solutions are pro- 
vided for the size distributions of gene families. Next, in 
Sec. Ill, we present results from analysis of gene fam- 
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ilies in sequenced genomes. These results rely heavily 
on the clusters of orthologous groups (COGs) database, 
which identifies gene families that span eight individual 
unicellular species including eubacteria, archaebacteria, 
cyanobacteria, and eukaryots We discuss which evo- 
lutionary model is most consistent with our observations 
in Sec. IV. 



II. THEORY 

For a single organism, let P„ be the number of gene 
families that contain n genes. The total number of fam- 
ilies is surrinPn = Ptot- We describe two models for the 
increase or decrease of the number of genes in the family. 

A. Model I: Gene Duplication 

In Model I, we assume that each gene in the family 
evolves independently. Each gene duplicates with rate 
and each gene is lost with rate fc_ . With each generation, 
the change in the number of families of size n is 

AP„ = (n - l)k+Pn-i + {n + l)fc_P„+i - + fc+)P„. 

(1) 

After sufficient time, the distribution reaches equilibrium 
values. Detailed balance indicates that the number of 
families increasing from size n to n -I- 1 should equal the 
number of families decreasing from size n -|- 1 to size n, 

nfc+P„ = (n + l)fc_P„+i. (2) 

The resulting expression for the populations is 

Pn/Ptot = (l/n)a"/[-/n(l - a)], (3) 

where we have defined a as k-^-/k^. Alternatively, nor- 
malizing by the families with a single member, we have 

PrjPi = (l/n)a"-i. (4) 

In addition to describing dynamics when each gene 
is duplicated individually, this model can also represent 
a system in which large genomic regions are duplicated 
or lost, provided that only one member of the family is 
present in the duplicated region. If, for example, a single 
chromosome is duplicated, this model could apply. 

The populations Pn/Pi predicted by Model I are shown 
as black lines in Fig. ^ for three choices of the parame- 
ter a: 0.1 (thin black line), 0.3 (medium black line), and 
0.9 (thick black line). As the value of a increases, the 
distribution of families shifts to larger sizes. The shape 
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of the distribution changes from a straight hne on the 
log-log plot at small n, characteristic of a power-law dis- 
tribution, to a curved line at larger n, characterstic of 
the faster decay of an exponential distribution. 

B. Model II: Genome Duplication 

In Model II, we assume that genome duplication domi- 
nates the evolutionary process. Each genome can double 
in size with probability fc-i- or be reduced by half with 
probability fc_ . Writing the size of a family after j dou- 
blings as n = 2^ , the evolution of j at each generation 
is 

APj- = /c+Pj_i -I- fc-Pj+i - {k+ + k-)Pj. (5) 

Again relying on detailed balance, we find that Pj , 
with a = fc+ /fc_ as before. For normalization, we assume 
that Y.j Pj = ^tot , yielding 

P, = {l-a)a^. (6) 

To change variables from j to n, we make an approx- 
imation that the discrete values of j and n may be re- 
placed by a continuous distribution. The distribution for 
n is then P„ = Pj(^n^dj{n) / dn, where j{n) = log2(?T.), 
giving 

Pn/Ptot = [(1 -a)/ln2]7i('""/'"2)-i. (7) 

Because we used a continuous distribution to derive this 
result, the normalization is not exact. The power-law 
form of the distribution, however, is accurate, and sim- 
ple summation may be used to define the normalization 
constant. 

Alternatively, the distribution may be defined relative 
to the number of families of size 1 , or 

P„/Pi ^n^''^"/'"^)-!^ (8) 

Results for Pn/Pi are shown as grey lines in Fig. |l| for 
three values of a: 0.1 (thin grey line), 0.3 (medium grey 
line), and 0.9 (thick grey line). As these are power-law 
distributions, they are straight on a log- log plot. The 
distribution favors larger family sizes as a increases. 

III. RESULTS 

To investigate the size distributions of gene fami- 
lies in nature, we analyzed the contents of the COG 
database . This database uses essentially unsupervised 
sequence-similarity comparisons to group proteins into 
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families of orthologs and paralogs. The current release 
includes 8328 proteins from eight sequenced genomes (E. 
coli, H. influenzae, H. pylori, M. genitalium, M. pneu- 
moniaa, Synechocystis, M. jannaschii, and S. cerevisiae) 
and assigns them to 864 individual families. Only pro- 
teins with orthologs in at least three species are included 
in the database . Using this database, we computed the 
number of families of size n, P„, for each species, then 
normalized the result by Pi for the same species. The 
results of this analysis are shown in Fig. ||. 

As seen in Fig. all the species show power-law be- 
havior for Pn/Pi as a function of n for families of size 10 
or smaller. The linear trend indicates that Model II, du- 
plication of the entire genome, is more likely than Model 
I, in which individual genes are duplicated. 

We explore the linear trend more quantitatively by per- 
forming a least-squares fit of the data for each model. 
The quantity we minimize is the RMS error for the log- 
transformed data, 

RMS = kl/N) [logio(P«/Pl)data-logio(^'nm)fit] 



■■P„>2 



(9) 

with (P„/Pi)fit from Eq. ^ or Eq. ^. As noted in the 
summation, we considered only family sizes n with P„ — 
2 or more; the total number of family sizes used is N. The 
results of the fit are detailed in Table J, along with the 
number of family sizes that contributed to the fit. The 
model with the smaller RMS for the fit is also indicated. 

As seen in Table |[ Model II (complete genome dupli- 
cation) provides a consistently better fit to the data than 
does Model I (individual gene duplication). In particular, 
when all of the protein families for a given organism are 
considered, each of the eight organisms shows a better fit 
with Model II than with Model I. 

In Table || the fit values for a are also shown for 
the functional classes defined in the COG database: 
information storage and processing, cellular processes, 
metabolism, and poorly characterized Q. These indi- 
vidual classes are also fit better by Model II than by 
Model I. In E. coli, H. influenzae, H. pylori, M. pnuemo- 
niae, and Synechocystis, at least three of the four classes 
are fit better by Model II; in M. genitalium, there are 
not enough protein families for adequate predictions of 
a. Only in S. cerevisiae does Model I appear to provide 
a slightly better fit to the distribution of family sizes 
for two classes, information storage and processing and 
cellular processes. One possible explanation for the bet- 
ter performance of Model I for S. cerevisiae is that gene 
families grow through the duplication of chromosomes, 
rather than the duplication of individual genes or en- 
tire genomes. The distinction between the genome and 
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individual chromosomes is not applicable to the other 
organisms, which have a single chromosome. 

A trend evident in Table | is that a for cellular pro- 
cesses (molecular chaperones, outer membrane, cell wall 
biogenesis, secretion, motility, inorganic ion transport 
and metabolism) is typically larger than a for infor- 
mation storage and processing (translation, ribosomal 
structure and biogenesis, transcription, replication, re- 
pair, recombination) and for metabolism (energy produc- 
tion and conversion, carbohydrate metabolism and trans- 
port, amino acid metabolism and transport, coenzyme 
metabolism, lipid metabolism). Protein families for cel- 
lular processes are therefore biased towards larger sizes, 
while families for information storage and processing and 
metabolism are biased toward smaller family sizes. This 
would imply that, in either model, a duplication of cel- 
lular process proteins is more likely to be retained than 
duplications of other functions. This suggests that cells 
can tolerate changes to cellular process pathways more 
readily than to other pathways. 

The relative performance of Model I and Model II ac- 
cording to protein family functional class is summarized 
in Table ||. When all classes are considered, Model II 
clearly provides a better explanation of the observed fam- 
ily sizes. When classes are considered separately, Model 
II provides a better explanation for three classes (infor- 
mation storage and processing, metabolism, and poorly 
characterized functions), while Model I provides a better 
explanation only for cellular processes. 

The fits provided by Model I and Model II are shown 
in Fig. ||for E. coli and S. cerevisiae. The observed fam- 
ily size distributions are shown as points and the best fits 
as lines, grey for Model I and black for Model II. The top 
pair of panels shows the results when all protein families 
are considered. For families up to size 10, the distribu- 
tions from both organisms clearly follow the power-law 
prediction of Model II. 

For the separate protein classes, the E. coli family sizes 
continue to follow the power-law prediction of Model II. 
As mentioned previously for S. cerevisiae, however, the fit 
to Model II is not good for the storage and processing and 
cellular processes classes. The size distribution decays 
much more rapidly than Model II predicts. 

IV. DISCUSSION 

We have investigated the size distribution of protein 
families. For a selection of single-celled organisms with 
sequenced genomes, we find that the number of fam- 
ilies with n members follows a power-law distribution 
as a function of n. This behavior suggests that evolu- 
tion increases protein diversity through duplication of 
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entire genomes, balanced occasionally by the loss of large 
amounts of genetic information. It is less likely that pro- 
tein diversity is increased through the duplication of in- 
dividual genes, since this process would not lead to a 
power-law distribution. 

The power-law we find is that P„/Pi where P„ is 
the number of families of size n. The exponent a varies 
from 0.2 to 0.6 depending on species. In our theory, this 
exponent measures the ratio of the rate of genome dupli- 
cation to the rate of gene loss. The behavior we obtain 
for all species indicates that the rate of genome duplica- 
tion, relative to the rate of gene loss, is approximately the 
same for each species. This points to the ancient origin 
of the cellular machinery responsible for the duplication 
of DNA. 

Different classes of genes evolve at slightly different 
rates. Families that perform cellular processes tend to 
be larger than average. Supplementing these functions 
might provide a disproportionate selective advantage. 
Also, the remaining functions (information storage and 
processing and metabolism) could represent core cellular 
machinery that is relatively standard and requires less 
variability. 

It would be interesting to verify whether the same pro- 
tein family size distributions are observed in multicellu- 
lar plants and animals. One might expect that genome 
duplication would be supplanted by chromosome duplica- 
tion, which would shift the family size distribution from a 
power law to a steeper, almost exponential decay. Some 
evidence in this direction is already provided with the 



S. cerevisiae data presented in Sec. [II. With the C. el- 



egans sequence reported |10(], the D. melanogaster se- 
quence promised within a year and a rough draft of 
the H. sapiens genome imminent |12[| , this question might 
soon be answered. 
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FIG. 1. The family size distribution P„/P\ is shown for 
three values of a: 0.1 (thin line), 0.3 (medium line), and 0.9 
(thick lino). Results arc displayed both for Model I (grey 
lines) and Model II (black lines). Model II, which predicts a 
power-law distribution, is linear on a log-log plot. 

FIG. 2. The size distributions P„/Pi of protein families arc 
shown for the eight organisms included in the COG database. 
The linear trend on the log-log plot is evidence for genome 
duplication being the primary evolutionary mechanism driv- 
ing the growth of gene families. Lines are provided as a guide 
to the eye. 

FIG. 3. The family size distributions P„/Pi arc shown for 
protein families in E. coli (left half) and S. ccrcvisiae (right 
half). Also shown are predictions of Model I (gene dupli- 
cations are independent, grey line) and Model II (the entire 
genome duplicates, black line). 



TABLE I. The parameter a as calculated from Model I and 
Model II is presented, along with the RMS of the fit, for the 
organisms and functional categories in the COG database. 



Organism / Functional category 


Model I 


Model II 




Better IN 




a 


RMS 


a 


RMS 






E. coli / All 


0.84 


0.39 


0.50 


0.16 


17 


II 


Information'' 


0.77 


0.55 


0.47 


0.39 


6 


II 


Cellular processes 


0.84 


0.20 


0.66 


0.12 


7 


II 


Metabolism 


0.81 


0.32 


0.53 


0.17 


10 


II 


Poorly characterized 


0.89 


0.39 


0.64 


0.25 


8 


II 


H. influenzae /AH 


0.56 


0.22 


0.31 


0.09 


8 


II 


Information 


0.36 


0.10 


0.25 


0.12 


4 


I 


Cellulax processes 


0.73 


0.14 


0.54 


0.03 


4 


II 


Metabolism 


0.53 


0.16 


0.34 


0.04 


5 


II 


Poorly characterized 


0.56 


0.10 


0.41 


0.05 


5 


II 


H. pylori / All 


0.54 


0.32 


0.30 


0.13 


7 


II 
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0, 


.47 


0, 


,25 


0, 


,30 


0, 


,14 


5 


II 


Cellular processes 


0, 


.48 


0, 


,01 


0, 


,38 


0, 


,09 


4 


I 


Metabolism 


0, 


.33 


0, 


,15 


0, 


,26 


0, 


,08 


3 


II 


Poorly characterized 


0, 


.73 


0. 


,39 


0, 


,49 


0, 


,25 


5 


II 
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0, 
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,12 
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,00 


0, 


,12 
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0, 
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,42 
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,13 
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II 


Cellular Processes 


0, 


.70 


0. 


,03 


0, 


,62 


0, 


,07 


4 


I 
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0, 


.53 


0. 


,21 
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,08 
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II 
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0, 
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0, 
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0, 


.23 


0, 


,22 
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3 


II 
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,70 
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6 


I 
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0, 


.54 


0. 


,23 


0, 


,32 


0, 


,08 


6 


II 


Poorly characterized 


0, 


.77 


0. 


,27 


0, 


,55 


0, 


,15 


8 


II 


S. cerevisiae / All 
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,25 
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,57 
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.95 


0. 


,16 


0, 


,92 


0, 


,16 


5 


I 
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0, 


.72 


0. 


,14 


0, 


,50 


0, 


,10 


9 


II 


Poorly characterized 


0, 


.95 


0. 


,13 


0, 


,85 


0, 


,09 


8 


II 



''A'fit is the number of family sizes used in the fit (all sizes 
with 2 or more families). 
''Information storage and processing 



TABLE II. The number of organisms for which Model I or 
Model II is a better fit is summarized according to protein 
functional classes. 

Functional class Model I Better Model II Better Tie 



All classes 8 

Information^ 2 5 1 

Cellular processes 4 2 2 

Metabolism 7 1 

Poorly characterized 1 7 



''Information storage and processing 
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E. coli 




n 



S. cerevisiae Functional class 




All 



1 10 100 




Information 
storage and 
processing 



1 10 100 




Cellular 
processes 



1 10 100 




Metabolism 



1 10 100 

n 



